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(57) After three kinds of data, i.e., a keyword fre- 
quency-of-appearance (1 03), a document length (1 05). 
and a keyword weight (1 07) are produced, a document 
profile vector (111) and a keyword profile vector (109) 
are calculated. Then, by Independently performing the 
weighted principal component analysis (112,114) con- 
sidering the document length and the keyword weight, 
a document feature vector and a keyword feature vec- 
tors are obtained. Then, documents and keywords hav- 
ing higher similarity to the feature vectors calculated 
with reference to the retrieval and extracting conditions 
are obtained and displayed. 
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Description 

[0001] The present invention relates to a similar document retrieving apparatus which designates one or plural doc- 
ument data from a document database (i. e. , set or assembly of document data) which is electronically stored as strings 

5 of character codes and machine treatable or processible, or designates an arbitrary sentence not Involved in this 
database, as a typical example. The similar document retrieving apparatus retrieves one or more documents similar 
to the designated typical example from the document database. Furthemnore. the present invention relates to a relevant 
keyword extracting apparatus which extracts one or more keywords relating to the "typical example" from the document 
database. The relevant keyword extracting apparatus presents the extracted keywords to the users of this document 

w database as an aid for comprehension of the retrieved document contents, eras a hint for preferable retrieval conditions 
(i.e., queries). Especially, the present invention makes it possible to perform highly accurate document retrieval and 
keyword extraction. 

[0002] Due to recent spread of wordprocessors and personal computers as well as large-scale and low-cost storage 
media, such as CD-ROM and DVD-ROM, and development of network, such as Ethernet, all of the documents or most 
... M- of 'Character information can.be practically stored as strings of character.codesJA..a.fulUex^ . ... — 

is now widely used. 

£0003] According to a conventkjnal full text database, In retrieving the documents, a Boolean expression of keywords 
is generally designated as queries. It is checked whether or not a designated keyword appears in the documents. And. 
a document set satisfying the Boolean expression is obtained as a retrieval resuH. 
20 [0004] Recently, a so-called document ranking technique is introduced and practically used. According to this ranking 
technique, the relevancy between each document in the obtained document set and the retrieval conditbns (i.e., que- 
ries) is obtained according to a so-called "tf-idf ' method orthe like. Then, the documents are ranked in oreler of relevancy 
and are presented to the users. 

[0005] However, this conventional full text database system is disadvantageous in the following points. 

25 

(1) When no appropriate keywords come up in mind or are found, it is difficult to designate appropriate retrieval 
conditions (i.e., queries). 

(2) Describing a complicated Boolean expression requires a high skill and enough time. 

(3) For the synonymy problem, there will be a possibility that an intended document cannot be retrieved. 

30 

[0006] In view of these problems, research and development for a similar docurtient retrieving system or a relevant 
keyword extracting system has recently become vigorous so as to effectively retrieve documents simllarto a designated 
typical example or to extract and display relevant keywords relating to the designated documents or word set. 
[0007] United States patent No. 4,839,853Jiscloses a conventional method for retrieving similar documents, which 
35 is called as LSI (latent semantic Indexing) method, 

[0008] To make clear the difference between the present invention and the LSI method, the gist of the LSI method 

will be explained. 

[0009] When applied to a document database D containing N document data, the LSI method mechanically extracts 
a keyword, i.e., a characteristic word representing each document, to record the frequency of occurrence (i.e., the 
40 number of times) of each keyword appearing in each document. It is now assumed that a total of M kinds of keywords 
are extracted from the document database D. 

[0010] Extracted keywords are aligned according to a dictionary order or an appropriate order. Then, a frequence- 
of-appearance f^t ^ t-th keyword is expressed as an element of d-th line and t-th row of a matrix F. Then, trough a 
matrix operation palled as. incomplete singular value decomposition, this matrix E is apprftximately .decomposed Into , 

45 a product of a n:iatrix U of N lines and K rows having document-side singular vector in each row, a diagonal matrix A 
of K lines and L rows having singular values aligned as diagonal elements, and a matrix V of K lines and M rows having 
a keyword-side singular vector in each line. In this case, K is sufficiently small compared with N and M. As a result, 
the original trequency-of-occunrence matrix F can be approximately expressed by a lower-rank matrix. 
[0011] A total of K document-side singular vectors are obtained through the above decomposition. Thus, a feature 

so vector of the document d is obtained as a K-dimensional vector containing respective d-th components of the 
obtained K document-side singular vectors. Simllariy, a total of K keyword-side singular vectors are obtained through 
the above decomposition. Thus, a feature vector Vt of the keyword t Is obtained as a K-dimensional vector containing 
respective t-th components of the obtained K keyword-side singular vectors. 

[0012] Subsequently, cabulatlon of similarity and relevancy is performed according to the following three procedures 
55 * so as to obtain documents and keywords having higher similarities and relevancies, thereby realizing the similar doc- 
ument retrieval and the relevant keyword extraction. 

(1)The similarity between two documents a and b is obtained by calculating an Inner product Ua-Ui> between the 
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document feature vectors Ug and Ub of these documents a and b. 

(2) The relevancy between two keywords Ka and Kb is obtained by calculating an inner product V^-Vb between 
two keyword feature vectors Va and of these keywords Ka and Kb. 

(3) Keyword extraction resurt fronn an arbitrary (external) document is represented by a M-dimensional vector E 
5 having components representing frequency-of-occun-ence values of M keywords appearing in this document. A 

retrieval condition document feature vector corresponding to this externa! document is represented by an ex- 
pression Ue = A'"*VE. Then, the similarity between this external document and the document d in the document 
database is obtained as a product U^-Ug. The above-described procedures are a fundamental framework of the 
LSI method. 

10 

[0013] However, if the keyword frequency-of-appearance f^^ is directly used In the application of the LSI method to 
an actual document database, the feature vector obtained will be somewhat deviated due to presence of longer doc- 
uments or frequently appearing keywords. This will significantly worsen the accuracy of similar document retrieval. 
[0014] Hence, the LTC method conventionally used in the relevant ranking of a document retrieving system or a 

._t5. . comparative method [s introduced to convert or nprnnallze . the keywprcJlrequency-Qj-Qcc 

of-occun-ence matrix F is created so as to contain the normalized frequency-of -occurrence values. Then, the incomplete 
singular value decomposition Is performed to obtain a feature vector. 

[0015] For example, according the LTC conversion, the following equation Is used to cateulate a frequency-of-oc- 

currence LTC (f^t) based on the actual frequency-of-occurrence f^tthe number nt of documents containing the keyword 
20 t. A matrix containing this value is subjected to the incomplete singular value decomposition. 



55 LTC{fdi) = 



30 



N 

(1 + I0g2fdi)l0g2(l + — ) 

nt 



1 



2|(l+l0g2fdi)l0g2(l + —) 



(1) 



[0016] However, the conversion of keyword frequency-of-occurrence by the conventional LSI method causes the 
following problems. 

[0017] Analysis according to the LSI method is performed on the assumption that ad-th line of the matrix F represents 
the feature of document d and a t-th row of the matrix F represents the feature of keyword t. In a first conversion, a 
35 square-sum of line elements can be nomnaiized to 1 . However, a square-sum of row elements cannot be normalized 
to 1 . Accordingly, the perfomned conversion becomes asymmetric between the document side and the keyword side. 
Thus, the simple conversion using the above equation 1 cannot nonnalize both of the document side and the keyword 
side to 1 . Such asymmetry can be found in a conversion using other equation. 

[0018] Furthermore, when a logarithmic function or other nonlinear function is used In the conversion as shown in 
40 the equation 1 , the feature of certain document d is not identical with the feature of document d' consisting of two 

successive documents d. Therefore, the similarity between the document d and the document d' is not equal to 1. 

Similariy, when two keywords t^ and t2 are identical in the frequency-of-occurrence as wen as in the meaning, a fre- 

quency-of-occun-ence matrix obtained on the assumption that two keywords t^ and t2 are the same does not agree 
. with.Jhe original frequency-of-occun-ence matrix. ,^ 
45 [001 9] The above-described asymmetry or the above-described non-stablllty caused by the mergence of documents 

or keywords with respect to the docunnent simiiarliy or the keyword relevancy causes the following phenomenons when 

a large-scale document database is processed. 

(1) |n the retrieving and extracting operation at the non-normalized side (i.e., keyword side in many cases), large 
so nonns (i.e., square-sum of elements of F) are chiefly retrieved or extracted. 

(2) When a document retrieval is performed in a keyword set, only certain keywords have very strong effects and 
others are almost neglected. 

■ 

[0020] Consequently, the obtained retrieval result will be the ones far from the intent of the retrieval. Thus, the ac- 

S5 curacy of retrieval is greatly worsened. 

[0021 1 To solve the above-descnbed problems of the prior art, the present invention has an object to provide a similar 
document retrieving apparatus and a relevant keyword extracting apparatus which can normalize both of the document 
side and the keyword side and maintain higher retrieving accuracy 
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[0022] To accomplish the above and other related objects, the present Invention provides a first similar document 
retrieving apparatus appilcable to a document database D which stores N document data containing a total of M Idnds . 
of keywords and is machine processlble, for designating a retrieval condition (I.e., query) consisting of a document 
group including at least one document . — . selected from the document database D and for retrieving documents 
5 similar to the document group of the retrieval condition from the document database D. The first similar document 
retrieving apparatus of this invention comprises; keyv^ord frequency-of-occurrence calculating means for calculating 
a keyword frequency-of-occun-ence data F whfch represents a frequency-of-occurrence f^t of each keyword t appearing 
in each document d stored in the document database D; document length calculating means for calculating a document 
length data L which represents a length i^j of each document d; keyword weight calculating means for calculating a 
10 keyword weight data W which represents a weight w^ of each keyword t of the M kinds of keywords appearing in the 
document database D; document profile vector producing means for producing a M-dimensional document profile 
vector Pj having components respectively representing a relative frequency-of-occurrence p^j^ of each keyword t in the 
concerned document d; document principal component analyzing means for performing a principal component analysis 
on a document profile vector group of a document group in the document database D and for obtaining a predefined 

. ...15. . ,,.,.(K)^dimensional document feature, vector Ud Corrjespqnding tp.tlie .d.QQument.pmfile yector.Pa fAr;jeach.xlocument xl;..^. . 

and similar document retrieving means for receiving the retrieval condition consisting of the document group including 
at least one document x^ .. — , x^ selected from the document database D, calculating a similarity between each docu- 
ment d and the retrieval condition based on a document feature vector of the received document group and the doc- 
ument feature vector of each document d in the document database D, and outpulting a designated number of similar 
documents in order of the calculated similarity. 

[0023] Furthennore, the present invention provides a second similar document retrieving apparatus applicable to a 
document database D whkh stores N document data containing a total of M kinds of keywords and Is machine proces- 
sible, for designating a retrieval condition (i.e., query) consisting of a keyword group including at least one keyword 
— ^ selected from the document database D and for retrieving documents relevantlo the retrieval condition from 
the document database D. In addition to the above-described keyword frequency-of-occurrence calculating means, 
the document length calculating means, the keyword weight calculating means, and the document profile vector pro- 
ducing means, the second similar document retrieving apparatus of this invention comprises: keyword profile vector 
calculating means for calculating a N-dimensional keyword profile vector Qj having components respectively repre- 
senting a relative frequency-of-occurrence q^^ of the concemed keyword t in each document d; document principal 
component analyzing means for performing a principal component analysis on a document profile vector group of a 
document group in the document database D and for obtaining a predefined (K)-dimensbnal document feature vector 
Urf corresponding to the document profile vector for each document d; keyword principal component analyzing 
means for perfomning a principal component analysis on a keyword profile vector group of a keyword group in the 
document database D and for obtaining a predefined (K)-dimensional keyword feature vector corresponding to the 
keywond profile vector Qj for each keyword t, the keyword feature vector having the same d'mension as that of the 
document feature vector, as well as for obtaining a keyword contribution factor (i.e., eigenvalue of a correlation matrix) 
Oj of each dimension j; retrieval condition feature vector calculating means for receiving the retrieval condition (i.e., 
query) consisting of keyword group including at least one keyword y^, — , y^, and for calculating a retrieval condition 
feature vector corresponding to the retrieval condition (i.e., query) based on the keyword weight data of the received 
keyword group, the keyword feature vector and the keyword contribution factor, and similar document retrieving means 
for calculating a similarity between each document d and the retrieval condition based on the calculated retrieval con- 
dition feature vector and a document feature vector of each document d, and outputting a designated number of similar 
documents in order of the calculated similarity. 
,...[0024] FurthenTipre..the .pjieseot invention provides a.flrst rejevant keyword , extracting. japparatus appjic^le tp a 
*5 document database D which stores N document data containing a total of M kinds of keywords and is machine proces- 
slble, for designating an extracting condition consisting of a keyword group including at least one keyword y^, — . yg 
selected from the document database D and for extracting keywords relevant to the keyword group of the extracting 
condition from the document database D. In addition to the above-described keyword frequency-of-occurrence calcu- 
lating means, the document length calculating means, and the keyword weight calculating means, the second relevant 
so keyword extracting apparatus of this invention comprises: keyword profile vector calculating means for calculating a 
Isl-dimensional keyword profile vector Qt having components respectively representing a relative frequency-of-occur- 
rence q^ft ofthe concerned keyword tin each document d; keyword principal component analyzing means for perfomning 
a principal component analysis on a keyword profile vector group of a keyword group in the document database D and 
for obtaining a predefined (K)-dimensional keyword feature vector corresponding to the keyword profile vector 
55 for each keyword t; and relevant keyword pxtracting means for receiving the extracting condition consisting of the 
keyword group including at least one keyword y^ , — . y^selected from the document database D, calculating a relevancy 
between each keyword t and the extracting condition based on a keyword feature vector of the received keyword group " 
and the keyword feature vector of each keyword t in the document database D, and outputting a designated number . 
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of relevant keywords in order of tha calculated relevancy. 

[0025] Furthermore, the present invention provides a second relevant keyword extracting apparatus applicable to a 
document database D which stores N docunnent data containing a total of M kinds of keywords and ismachine proces- 
sible, for designating an extracting condition consisting of a document group including at least one document x^, — . 

5 X, selected from the document database D and for extracting keywords relevant to the document group of the extracting 
condition from the document database D. In addition to the above-described l<eyword frequency-of-occurrence calcu- 
lating means, the document length calculating means, the keyword weight calculating means, the document profile 
vector producing means, and the keyword profile vector calculating means, the second relevant keyword extracting 
apparatus of this Invention comprises: document principal component analyzing means for perfomning a principal com- 

10 ponent analysis on a document profile vector group of a document group in the document database D and for obtaining 
a predefined (K)-dimensional document feature vector corresponding to the document profile vector for each 
document d as well as for obtaining a document contribution factor (i.e., eigenvalue of a correlation matrix) of each 
dimension j; keyword principal component analyzing means for perfomaing a principal component analysis on a keyword 
profile vector group of a keywonJ group in the document database D and for obtaining a predefined (K)-dimensional 

"15 keyword feature vector Vt corresponding to the keyword profile vectorQt f or^ach^keyword-tj^the^kaywordf eatufe vector 
having the same dimension as that of the document feature vector; extracting condttton feature vector cateulating 
means for receiving the extracting condition consisting of the document group including at least one document x-,, — , 
Xp and for calculating an extracting condition feature vector corresponding to the extracting condition based on the 
document length data of the received document group, the document feature vector and the document contribution 

50 factor, and relevant keyword extracting means for calculating a relevancy between each keyword t and the extracting 
condition based on the calculated extracting condition feature vector and a keyword feature vector of each keyword t. 
and outputtlng a designated number of relevant keywonds in order of the calculated relevancy 
[0026] According to the similar document retrieving apparatus and the relevant keyword extracting apparatus of the 
present invention, the frequency-of-occurrence of each keyword in a concerned docume|;Tt is expressed as a document 

25 profile vector and the frequency-of-appearance of a concerned keyword in each document as a keyword profile vector. 
A weighted principal component analysis considering the document length and the keyword weight is independently 
performed to obtain both of a document feature vector and a keyword feature vector 

[0027] In this case, the vector representation in the document profile and In the keyword profile is not dependent on 
the conversion (i.e., normalization) of frequency-of-occun-ence. The document length data and the keyword weight 
30 data, relevant to the conversion of frequency-of-occun-ence. are Indirectly reflected as the weight in the principal com- 
ponent analysis. Thus, it becomes possible to perform the nomnalization without depending on the conversion of fre- 
quency-of-occun-ence . 

[0028] As a result, the present invention makes it possible to provide the similar document retrieving apparatus and 
the relevant keyword extracting apparatus which are highly accurate. 
35 [0029] The above and other objects, features and advantages of the present invention will become more apparent 
from the following detailed description which is to be read in conjunction with the accompanying drawings, In which: 

Fig. 1 is a block diagram showing an overall arrangement of a similar document retrieving and relevant keyword 
extracting system in accordance with a preferred embodiment of the present Invention; 
40 Fig. 2 is a view showing an example of a newspaper full text database; 

Fig. 3 is a block diagram showing an internal arrangement of a keyword extracting and counting section in accord- 
ance with the preferred embodiment of the present invention; 

Fig. 4 is a conceptual diagram showing a practical example of the keyword extracting and counting processing in 

accordance with the pref en'ed embodiment of the present invention ; - • ^ f • • . • u ' • ' . ... :i ,.j w .- a 

45 Fig. 5 Is a flowchart showing the procedure for creating document length data in accordance with the prefen'ed 

embodiment of the present Invention; 

Fig. 6 is a flowchart showing the procedure for creating keyword weight data in accordance with the prefeaed 
embodiment of the present invention; 

Fig. 7 is a flowchart showing the procedure for creating document profile vector data in accordance with the pre- 
50 ferred embodiment of the present invention; 

Fig. 8 is aflowchart showing the procedure for creating keyword prof ile vector data in accordance with the preferred 
embodiment of the present invention; 

Fig. 9 is a flowchart showing the procedure for executing a principal component analysis on the document profile 
vector data in accordance with the preferred embodiment of the present invention; 
55 Fig. 1 0 is a flowchart showing the procedure for executing a principal component analysis on the keyword profile . 

vector data in accordance with the preferred embodiment of the present invention; 

Fig. 11 is a flowchart showing the procedure for calculating a retrieval condition feature vector in accordance with 
the preferred embodiment of the present invention: and 
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Fig. 12 Is a flowchart showing the procedure for calculating an extracting condition feature vector in accordance 
with the preferred embodiment of the present invention. 

[0030] Hereinafter, a prefen-ed embodiment of the present invention will be explained with reference to the attached 
5 drawings. Identical parts are denoted by the same reference numerals throughout the views. 

[0031] Fig. 1 is a blocl< diagram showing an overall arrangement of a similar document retrieving and relevant key- 
word extracting system which is realized as a function of a digital electronic computer and acts as a similar document 
retrieving apparatus and a relevant document extracting apparatus In accordance with a preferred embodiment of the 
present invention. 

10 [0032] This system comprises a newspaper full text database 101 storing various news documents together with 
their document numbers, headlines, and text bodies. Each news document, i.e., individual newspaper article, serves 
as a retrieval unit. A keyword extracting and counting section 102 scans character strings in the text body of each 
newspaper article stored in the newspaper full text database 101 to extract keywords appearing In each newspaper 
article and to count each keyword as the frequency of occun-ence of this word. A keyword frequency-of-occurrence 

■15. . ......file 103 stores an extraction and.countlng result obtained by.the keyword extracting. an.d c,ou.ntiag.§ecjtJ.g.i:i.1Q2.AclQC-. 

ument length calculating section 1 04 calculates a document length of each newspaper article according to a document 
length calculation mode given as an external parameter based on the number of characters In character strings con- 
tained in the body of each newspaper article by accessing the newspaper full text database 101 or based on the total 
number of keywords appearing In each newspaper article by accessing the keyword frequency-of-occurrence file 1 03. 

20 A document length file 105 stores calculation result obtained by the document length calculating section 1 04. A keyword 
weight calculating section 106 calculates a weighted value of each keyword with reference to the keyword frequency- 
of-occurrence file 1 03. A keyword weight file 1 07 stores calculation result obtained by the keyword weight calculating 
section 1 06. A keyword profile vector producing section 1 08 produces a keyword profile vector representing the feature 
of each keyword based on the keyword f requency-of-occun-ence file 1 03 and the docum.ent length file 1 05. A keyword 

25 profile vector file 1 09 stores keyword profile vector groups produced by the keyword profile vector producing section 
108. A document profile vector producing section 110 produces a document profile vector representing the feature of 
each document based on the keyword f requency-of-occun'ence file 1 03 and the keyword weight file 1 07. A document 
profile vector file 1 11 stores document profile vector groups produced by the document profile vector producing section 
110. 

30 [0033] A keyword principal component analyzing section 112 perfomns K-dimensional weighted principal component 
analysis of the keyword profile vector file 109 with reference to the keyword frequency-of-occurrence file 103. the 
document length file 1 05, and the keyword weight fBe 1 07, wherein K is a predetenmined external parameter. Through 
the K-dimensional weighted principal component analysis, the keyword principal component analyzing section 112 
obtains a total of K principal axes (i.e., eigenvector of correlation matrix) and a contribution factor of each principal axis 

35 (|.e., eigenvalue of correlation matrix). And, the keyword principal component analyzing section 112 obtains a feature 
vector (i.e., component or projection of K principal axes) of each keyword. A keyword principal component analysis 
result file 1 1 3 stores analysis result, I.e., feature vector of each keyword and contribution factor of each principal axis, 
obtained by the keyword principal component analyzing section 112. 

[0034] A document principal component analyzing section 114 performs K-dimensional weighted principal compo- 
40 nent analysis of the document profile vector file 111 with reference to the keyword frequency-of-occunence file 1 03, 
the document length file 1 05. and the keyword weight file 1 07. Through the K-dimensional weighted principal compo- 
nent analysis, the document principal component analyzing section 114 obtains a total of K principal axes and a con- 
tribution factor of each principal axis. And, the document principal component analyzing section 114 obtains a feature 
.. yeptor (i.e.» connppnent or projection of K prinpipal axes) of each docunfipnt, A document principal cpmponent analysis 
45 result file 1 1 5 stores analysis result, i.e., feature vector of each document and contribution factor of each principal axis, 
obtained by the document principal component analyzing section 114. 

[0035] A condition input section 116 allows an operator to input similar article retrieving and relevant keyword ex- 
tracting conditions for retrieving the newspaper full text database 101 according to the iorm of either the string of 
document numbers or the string of keywords. When a string of document numbers is entered through the condition 

so input section 116, a retrieval condition feature vector calculating section 117 calculates a retrieval condition feature 
vector corresponding to the entered string of document numbers with reference to a corresponding document feature 
vector in the document principal component analysis result file 115. Furthemnore, when a string of keywords is entered 
through the condition input section 116, the retrieval condition feature vector calculating section 117 cateulates a re- 
trieval condition feature vector corresponding to the entered string of keywords with reference to the keyword weight 

55 file 1 07 and the keyword principal component analysis result file 113. 

[0036] When a string of document numbers is entered through the condition input section 11 6, an extracting condition 
feature vector calculating section 118 calculates an extracting condition feature vector con-esponding to the entered 
string of document numbers with reference to the document length fiie 105 and the document principal component 
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analysis result file 115. Furthermore, when a string of keywords is entered through the condition input section 116, the 
e)(tractlng condition feature vector calculating section 1 1 8 calculates an extracting condition feature vector correspond- 
ing to the entered string of keywords with reference to a corresponding keyword feature vector in the keyword principal 
component analysis result file 113. 

5 [0037] A first similar document retrieving section 119 calculates an inner product (a maximum value of the inner 
product in the case a plurality of documents are designated) between the retrieval condition feature vector calculated 
by the retrieval condition feature vector calculating section 117 and each document feature vector in the document 
principal component analysis result file 115. Then, the first similar document retrieving section 119 detemnines docu- 
ment numbers of 1st to R-th largest documents In the cakiulated inner product value (R represents the number of 

10 acquired documents which is a predetemiined external parameter). 

[0038] A second similar document retrieving section 120 calculates a distance (a minimum value of the distance In 
the case a plurality of documents are designated) between the retrieval condition feature vector calculated by the 
retrieval condition feature vector calculating section 1 1 7 and each document feature vector in the document principal 
component analysis result file 115. Then, the second similar document retrieving section 120 determines document 

. 15 .. numbers of Ist to R-th smallest documents in the calculated distance value.. . , . . . . ,.-v . ... 

[0039] A first relevant keyword extracting section 121 calculates an inner product (a maximum value of the inner 
product In the case a plurality of keywords are designated) between the extracting condition feature vector calculated 
by the extracting condition feature vector calculating section 118 and each keyword feature vector In the keyword 
principal component analysis result file 113. Then, the first relevant keyword extracting section 121 determines 1st to 

20 s-th largest keywords in the calculated inner product value (S represents the number of acquired keywords which is 
a predetemnined external parameter). 

[0040] A second relevant keyword extracting section 1 22 cateulates a distance (a minimum value of the distance in 
the case a plurality of keywords are designated) between the extracting condition feature vector calculated by the 
extracting condition feature vector calculating section 118 and each keyword feature vector in the keyword principal 
25 component analysis result file 113. Then, the second relevant keyword extracting section 122 detemnines 1st to S-th 
smallest keywords in the calculated distance value. 

[0041] A result display section 123 displays the document numbers and titles of retrieved R similar articles as well 
as the extracted S keywords with their similarities which are displayed in order of the magnitude of similarity. 
[0042] Next, an operation of the above-described similar document retrieving and relevant keyword extracting system 
30 wilt be explained. 

[0043] First, a schematic operation of this system will be explained. This system can retrieve newspaper articles by 
accessing the newspaper full text database 101 . When an operator designates document numbers, e.g., 2. 4. 9—, of 
the art:icles similar to an intended article through the condition input section 116, this system retrieves articles similar 
to the designated articles and extracts keywords relevant to the retrieved articles. The result display section 1 23 displays 
35 the retrieved similar documents and the extracted relevant keywords. When an operator designates a string of key- 
words, e.g., FT, internet, — , through the condition input section 116, this system retrieves articles sinnilarto the articles 
containing the designated keywords and extracts keywords relevant to the retrieved ari:ldes. The result display section 
1 23 displays the retrieved similar documents and the extracted relevant keywords. 

[0044] The operation of this system consists of the following three stages (I), (II) and (111) which are perfomied in this 
40 order. 

(I) Prior to the similar document retrieving and relevant keyword extracting operation, the newspaper lull text da- 
tabase 1 01 are segmented into keywords and processed into three kinds of data: frequency of occurrence of each 

keyword; length.9t e^pMW^nf^®"^ • 
45 (il) A profile vector data, serving as an object of principal component analysis. Is produced for each of document 

and keyword. The document profile vector data Is a vector whose components represent relative frequency-of- 
occurrence values of respective keywords in a concerned document. TTie keyword profile vector data is a vector 
whose components represent relative f requency-of-occun-ence values of a concerned keyword In each document 
in the document database. 

so 

[0045] Next, considering the document length an d the keyword weight, the principal component analysis is perfomied 
for respective profile vector data to obtain feature vectors of respective documents and keywords (i.e., vectors having 
featured components). 

55 (III) When the conditions for the similar document retrieving and relevant keyword extracting operation are entered, 

feature vectors of the similar document retrieval conditions and the relevant keyword extracting conditions are 
calculated according to the type of entered conditions (i.e., document number or keyword) with reference to the 
analysis result of the second stage (11) as well as the document length and the keyword weight. Similarily between 
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the retrieval condition feature vector and the document feature vector of each document and relevancy between 
the extracting condition feature vector and the l<eyword feature vector of each l<Byword are caiculated based on 
the inner product or the distance between the vectors. Then, a designated number of similar documents and 
relevant i<eywords are displayed together with their similarities. 

5 

[0046] Furthermore, the following parameters are set beforehand to execute the above operation, 

* Document length calculating mode ("number of characters" or "number of words") 

10 [0047] This is a parameter determining a data sen/ing as a basis for defining the document length of a concemed 
newspaper article. When the "number of characters" is selected, the document length of a concerned newspaper article 
is calculated based on the number of characters involved in the text body of the concerned article. When the "number 
of words" is selected, the document length of a concerned newspaper article is calculated based on the total number 
of keywords (including repetitive counting of the same keyword) obtained from the text body of the concemed article. 

* Document length threshold (Iq) 

[0048] This is a parameter, being a nonnegatlve integer, detennlnlng a lower llrhit of the document length In cateulating 
the document length of a concerned newspaper article. When the number of characters or the total n umber of keywords 
20 is smallerthan the document length threshold io, the document length of the concerned newspaper article is calculated 
by using the document length threshold Iq instead of using an actual value. 

* Document length root (5) 

25 [0049] This is a parameter, being a nonnegative integer, determining a document length based on a data sen/ing as 
a basis for the document length in calculating the document length of a concemed newspaper article. The document 
length of a concerned newspaper article is calculated as a 5-th root of the number of characters or the total number of 
keywords. When the number of characters or the total number of keywords Is smaller than the document length thresh- 
old Iq, the document length of the concemed newspaper article is calculated as a 5-th root of the document length 

30 threshold Iq. 

* Keyword weight calculating mode C^l +log" or "log") 

[0050] This is a first parameter detemnining a method of calculating the weight of a concemed keyword. When a 
35 "1+iog" mode Is selected, the weight of a concemed keyword is calculated according to an expression l+loggiN/n) 
where N represents the number of all the documents and n represents the number of documents involving the con- 
cerned keyword. When a "log" mode is selected, the weight of a concemed keyword is calculated according to an 
expression log2((N+1 )/n). When a keyword weight offset G is not 0, the keyword weight is calculated based on con-ected 
values for the entire document number N and the keyword related document number n. 

40 

* Keyword weight offset (G) 

[0051] This is a second parameter detenmining a method of calculating the weight of a concerned keyword. A keyword 

.......yveight off^et.e Js added ^tQ..each .of,the entire. d99^r7),9p^flugibBL related. document number..n,.ln 

45 calculating the keyword weight, N+e and n+G are used as representing the entire document number and the keyword 
related document number. Thus, by using N+G and n+G, the keyword weight Is calculated according to the above- 
described keyword weight calculating mode. 

* Analysis dimension (K) 

so 

[0052] This is a parameter, being a positive integer, detemnining the dimension of analysis in perfonning the principal 
component analysis. When the K-dimension is designated, a total of K (at maximum) sets of eigenvalues and eigen- 
vectors of the con-elation matrix data are obtained to express the K-dimensional feature vectors for the document and 
the keyword. 

55 

* Document similarity calculating mode ("inner product" or "distance") 

[0053] This is a parameter designating either the first similar document retrieving section 11 9 or the second similar 
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document retrieving section 120 as a source of similar document retrieval result to be displayed on the result display 
section 123. When the "inner product" mode Is selected, the result display section 123 displays the retrieval result 
obtained by the first similar document retrieving section 1 1 9. When the "distance" mode is selected, the result display 
section 123 displays the retrieval result obtained by the second similar document retrieving section 120. 

5 

* Keyword relevancy calculating mode ("inner product" or "distance") 

[0054] This is a parameter designating either the first relevant keyword extracting section 1 21 orthe second relevant 
keyword extracting section 122 as a source of relevant keyword extraction result to be displayed on the result display 
10 section 123. When the "inner producr mode is selected, the result display section 123 displays the extraction result 
obtained by the first relevant keyword extracting section 121 . When the "distance" mode Is selected, the result display 
section 123 displays the extraction result obtained by the second relevant keyword extracting section 122. 

* Displayed similar document number (a) 

[0055] This is a parameter determining the number of documents to be displayed as the result of similar document 
retrieval. When the displayed similar document number a Is designated, a total of a documents are displayed in order 
of the magnitude of similarity. 

20 * Displayed relevant keyword number (P) 

[0056] This is a parameter determining the number of keywords to be displayed as the result of relevant keyword 
extraction. When the displayed relevant keyword number p is designated, a total of p keywords are displayed in order 
of the magnitude of relevancy. 

25 [0057] After the settings of above-described parameters is finished, the stages (I) and (II) are perfonned successively 
based on the thus preset parameters to analyze the newspaper full text database 101, thereby accomplishing the 
preparation for the similar document retrieval and relevant keyword extraction. After this moment, when the conditions 
for the similar document retrieving and relevant keyword extracting operation are entered through the condition input 
section 11 6, the stage (III) is perfonned based on the preset parameters with reference to the analysis result obtained 

30 in the stages (I) and (II) to obtain the similar documents and relevant keywonJs. The obtained similar documents and 
relevant keywords are displayed on the result display section 1 23. In a case where the similar document retrieving and 
relevant keyword extracting operation is performed repetitively by accessing the same newspaper full text database 
1 oi, the analyzing processing of the stages (i) and (II) is performed only one time and the processing of the stage (III) 
is perfomned repetitively as much as necessary. 

35 [0058] The system operates schematically as described above. Next, the detailed operation of the system will be 
explained successively in order of the stages (I), (II) and (III). 

[0059] First, the processing in the stage (I) is explained with reference to the drawings. In the stage (1), keywords 
contained in the newspaper full text database 101 are segmented and processed Into three kinds of data: frequency 
of occurrence of each keyword; length of each document; and weight of each keyword. 

40 [0060] Fig. 2 shows part of an example of the contents of newspaper full text database 1 01 . 

[0061] As shown in Fig. 2, the newspaper full text database 1 01 is a text format which is editable and readable 
through an electronic computer. Each newspaper article is regarded as a single document serving as a retrieval unit. 
The newspaper full text database 101 stores a total of 200,000 newspaper artbles according to the ascending order 
of their document numbers^ Each newspaper article can be classified according to«three;fields of doG^ 

45 title, and text body. Three kinds of fields are connected in this order by means of tab character (i.e., a sort of control 
character indicated as <TAB> In the drawing). One document and the next document are connected by means of fomi 
teed character (i.e., a sort of control character indicated as <FF> in the drawing). Document number 1 is assigned to 
a head or leading (i.e.. first) newspaper article. Document number 200,000 is assigned to the last (i.e., 200,000th) 
newspaper article. 

50 [0062] First, the newspaper full text database is entered into the keyword extracting and counting section 102. 

[0063] Fig. 3 is a block diagram showing an internal arrangement of the keyword extracting and counting section 
1 02 which is encircled by a dotted line. The keyword extracting and counting section 1 02 comprises a word segmenting 
section 301, a word dictionary 302. a keyword selecting section 303, an stop-word dictionary 304, and a keyword 
counting section 305. 

55 [0064] First, the word segmenting section 301 reads out one document from the newspaper full text database 1 01 
and picks up any words (morphemes) capable of becoming keyword candidates. A similar document retrieving appa- 
ratus and a relevant keyword extracting apparatus according to the present invention do not depend on a specific word 
segmenting method. Hence, vartous word segmenting methods conventionally known can be used in the present in- 
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vention. 

[0065] For example. "Iwanami Course. Language Science 3 • Word and Dictionary " by Yuji MATSUMOTO at al, 
1997. Iwanamr Shoten Publishers, discloses a word segmenting technique using a morphelosical analysis based on 
a dictionary and an adjacent cost or a statistic language model. The unexamined Japanese patent publication No. 
5 10-69493 discloses a word segmentation using only a word dictionary according to a "maximal segmenting" method. 
[0066] According to this embodiment, words are segmented by using the word dictionary 302, for example, according 
to the "maximal segmenting" method disclosed in the unexamined Japanese patent publication No. 10-69493. 
[0067] The keyword selectin g section 303 judges wrth reference to the stop-word dictionary 304 whether or not each 
segmented word is an extraneous or unnecessary word in performing the similar document retrieving and relevant 
10 keyword extracting operation. When judged as not being a stop word, the segmented word Is recognized as a keyword. 
A keyword number is assigned to each newly recognized keyword according to the recognized order. The keyword 
counting section 305 counts the frequency of appearance of each keyword appearing in one document (I.e., one news- 
paper article). After processing all of the character strings contained In one document. the count result of this document 
Is sent to the keyword frequency-of-occurrence file 103. Then, the processing of the next document is commenced. 

i..;..: .r^.:,v..[00681.v .Through.the.-abave-descrLb.0d,opj3raUon..all of the^ documents involved Jnih.eAewspaper.fulL text database...-v-r - 

1 01 are processed In order of the document number and the keyword f requency-ot-occurrence file 1 03 is created. 
[0069] Fig. 4 shows a practical example of the word segmenting processing. In Fig. 4, "word segmentation result" 
shows an example of partial result obtained through theword segmentation performed on a text body field of a document 
#1 according to the maximal word segmenting method with reference to a left word diclionary. Each underlined word 
20 (i.e., character string) is a word registered in the dictionary. Each word or character string encircled by a rectangular 
line is a segmented word or character string. Then, stop words, such as " CO" and are removed from the seg- 
mentation result. The frequency of occurrence of each wonJ is counted to obtain "keyword extraction result of document 
#1." 

[0070] Next, the document length calculating section 1 04 calculates the length of each document according to three 
2S kinds of predetemnined parameters: document length calculating mode; document length threshold (Iq); and document 

length root (5). Fig. 5 is a flowchart showing a procedure for calculating the document length of a concerned document. 

[0071] In step 501 , it is checked whether the document length calculating mode Is the "number of characters" or not. 

When the document length calculating mode is the "number of characters", the control flow proceeds to step 502. 

When the document length calculating mode is the "number of words", the control flow proceeds to step 503. 
30 [0072] In step 502, I.e.. when the document length calculating mode is the "number of characters", 1 is refen-ed to 

as representing the number of characters contained in a text body field of the concemed document which Is obtained 

with reference to the newspaper full text database 1 01 . 

[0073] In step 503, i.e., when the document length calculating mode Is the "number of words", 1 is referred to as 
representing the total number of keywords (Including repetitive counting of the same keyword) segmented from the 
35 text body field of the concerned document which is obtained with reference to the keyword frequency-of-occun-ence 
file 1 03. 

[0074] In step 504, it Is checked whether or not the calculated value 1 of step 502 or 503 Is smaller than the document 
length threshold Iq, 

[0075] if 1 is smaller than Iq, the control flow proceeds to step 505 to replace the value I with \q, 
40 [0076] After finishing step 505 or If 1 is not smaller than Iq. the control flow proceeds to step 506 to further replace 
the val ue 1 with a 5-th root of 1 . The value 1 thus calculated represents the document length of the concemed document 
and is recorded in the document length file 105. 

[0077] For example, the document length calculating mode is set to the "number of characters", the document length 
. . . threshold.lo.is aettp 200. and..the document length. r.oot^S is 0-5. Document #1. shown Iri. Fig. 2 contelns.396 characiters . 

45 in the text body field. In this case, the document length 1 of document #1 is calculated as 19.9 through the above- 
described processing. As document #3 shown In Fig. 2 contains 302 characters in the text body field, the document 
length 1 of document #3 is calculated as 17.38 through the above-described processing. In this manner, Ihe above- 
described processing for obtaining the document length 1 is performed for all of the documents in order of the document 
number, thereby accomplishing the document length file 105. 

so [0078] While the document length calculating processing is performed, the keyword weight calculating section 1 06 
calculates a weight of each keyword according to two kinds of predetermined parameters: the keyword weight calcu- 
lating mode; and the keyword weight offset (£). Fig. 6 is a flowchart showing a procedure for calculating the keyword 
weight of a specific keyword t. 

[0079] In step 601 , the number r of documents containing the concemed keyword t is calculated with reference to 
55 the keyword frequency-of-occurrence file 1 03. 

[0080] In step 602, the number r obtained in step 601 is replaced by a sum of the number r and the keyword weight 
offset G (i.e. r r + G). Meanwhile, a value s is given as a sum of the number N of all documents and the keyword 
weight offset G (I.e. s <- N + G). 
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[0081] In step 603, it is checked whether the keyword weight calculating mode Is "Wog" or not. When the keyword 
weight calculating mode is "k-log", the control flow proceeds to step 604. When the keyword weight calculating mode 
is "log", the control flow proceeds to step 505. 

[0082] In step 604, I.e.. when the keyword weight calculating nnode is "l+Iog", w is represented by an expression 
5 1+log2(s/r). 

[0083] In step 605, i.e., when the keyword weight calculating mode is "log", w is represented by an expression logg 
((s+1)/r). 

[0084] In step 606, the calculated value w is sent to the keyword weight file 1 07 and registered as a keyword weight . 
for the concerned keyword t. 

10 [0085] The above-descrtoed processing for obtaining the keyword weight w is performed for all of the keywords in 
order of the keyword nunr4)er, thereby accomplishing the keyword weight file 107. 

[0086] For example, the keyword weight calculating mode is set to "1+log", and the keyword weight offset e is set 
to 1 0. It Is now assumed that a keyword "I Tfi^W (fT technology)" appears in the text body of a total of 22 newspaper 
articles. In this case, the keyword weight of "IT K^j^ (IT technology)" \sca\ciJi\aie6 as 13.61 through the above-described 
..45*,..^.piBcessing* lti&also assumed that^a:keywor-d "HJ^] Y^ornesfyc>^appeais4n^the4extbDdy;.of a-.totat.of-251.anewspapen — 
angles. In this case, the keyword weight of "dp^ (domestic)' is calculated as 7.31 through the above-described 
processing. 

[0087] In this manner, through the processing of stage (1), the keyword frequency-of-occurrence file 103, the docu- 
ment length file 105, and the keyword weight file 1 07 are created. 
20 [0088] Second, the processing in the stage (I I) is explained with reference to the drawings. In the stage (II), the profile 

vector data of document/keyword Is produced. The principal component analysis Is perforrned for respective profile 

vector data to obtain feature vectors of respective documents and keywords. 

[0089] Fig. 7 is a flowchart showing a procedure for calculating the document profile vector data. 

[0090] in step 701 . to create a document profile vector, a concerned document number d is initialized to 1 (i.e., d 1 ) . 
25 [0091] In step 702, it is checked whether or not the concerned document number d is equal to or smaller than the 

number N of all documents. When d is larger than N, the control flow proceeds to step 703 to temiinate this calculation 

processing. When d is equal to or smaller than N, the control flow proceeds to step 704. 

[0092] In step 704, the keyword number t is initialized to 1 , and the normalized factor s is initialized to 0. 

[0093] In step 705, a frequency-cf-occurrence f^t of the keyword t of the document d is obtained with reference to 
50 the keyword frequency-of-occurrence file 1 03. Then, the normalized factor s Is replaced with a sum of s and f^ft (i.e., 

s<- s+ fdt). 

[0094] In step 706, the concerned keyword number t is incremented by 1 (i.e., t<-t+1). 

[0095] In step 707, it is checked whether the keyword number t is equal to or smaller than the number M of all 
keywords. When t is equal to or smaller than M, the control flow returns to step 705 to process the next keyword. 
35 [0096] Through the above steps 704 to 707, repetitive appearances of the same keyword are counted every time to 
obtain the total number of keywords appearing in the document d. The obtained total Is referred to as the nonnalized 

factor s. 

[0097] After the normalized factor s Is obtained according to the document profile vector calculating mode in this 
manner, the control flow proceeds to istep 708. 
40 [0098] In step 708, the document profile vector of the document d is cateulated based on a relative frequency-of- 
occurrence vector (i.e., (f^^/s, — , fcf^s)) and sent to the document profile vectorflle 111 . 

[0099] In step 709, the concerned document number d is incremented by 1 (d<-d+1). Then, the control flow retums 
to step 702 to process the next document. 

.„ [0100],. ..Xhroughiheaboverdescribed processing, the.docume^^ - ■ .v.... - ,. -f. . v... 

45 [0101] For example, when a calculation value of the nomialized factor s is 92. the document profile vector of the 
document #1 in the newspaper article full text database shown In Fig. 2 is obtained in the following manner with ref- 
erence to the keywords numbered in the word dictionary of Fig. 4. 



(2/92, 1/92, 0, 1/92. 1/92, 0, 0, 1/92, ) 

4 

where the first component of the document profile vector corresponds to #1 keyword"^i/ >^o-' stored in the 
word dictionary. Similarly, second and third components correspond to #2 keyword "IT" and #3 keyword "ITfct^5" in the 
word dictionary. 

55 [0102] While the document prof ile vector file 111 is created, the keyword profile vectorflle 109 is created. Fig. 8 is a 
flowchart showing a procedure for calculating the keyword profile vector data. 

[0103] In step 801 , to produce a keyword profile vector, a concerned keyword number t is initialized to 1 (i.e.. t <-1). 
[0104] in step 802, it is checked whether or not the concemed keyword number t is equal to or smaller than the 
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[0081] In step 603, it is checked whether the keyword weight calculating mode is "l+log" or not. When the keyword 
' weight calculating mode Is "l+log", the control flow proceeds to step 604.When the keyword weight calculating mode 
is "log", the control flow proceeds to step 605. 

[0082] in step 604, i.e., when the keyword weight calculating mode Is "l+log', w Is represented by an expression 

5 l4-iog2(s/r). 

[0083] In step 605. i.e. , when the keyword weight calculating mode is "log", w is represented by an expression logg 

((s+1)/r). [ ^ . 

[0084] In step 606, the calculated value w is sent to the keyword weight file 1 07 and registered as a keyword weight 

for the concerned keyword t. 

10 [0085] The above-described processing for obtaining the keyword weight w is performed for all of the keywords in 
order of the keyword number, thereby accomplishing the keyword weight file 107. 

[0086] For example, the keyword weight calculating mode is set to "1 +log". and the keyword weight offset G is set 
to 1 0. It is now assumed that a keyword "I TUttf (IT technology)" appears in the text body of a total of 22 newspaper 
articles. In this case, the keyword weight of "IT K^ff (IT technology)" \s calculated as 1 3.61 through the above-described 
. J5 . processing. It is also assijmed:that a keyword "5!^ (domestic) " appears in ,th.e.te>ct.bQCiy .of a tptgii ^t2S19-ine^wsRaper , 
articles. In this case, the keyword weight of "Hf^j (domestic)" Is calculated as 7.31 through the above-described 
processing. 

[0087] In this manner, through the processing of stage (1). the keyword frequency-of-occurrence file 1 03, the docu- 
ment length file 105. and the keyword weight file 107 are created. 

20 [0088] Second, the processing in the stage (II) is explained with reference to the drawings. In the stage (M). the profile 
vector data of document/keyword is produced. The principal component analysis is performed for respective profile 
vector data to obtain feature vectors of respective documents and keywords. 
[0089] Fig. 7 is a flowchart showing a procedure for calculating the document profile vector data, 
[0090] In step 701 , to create a document prof lie vector, a concemed document number d is initialized to 1 (i.e., d <-1 ). 

25 [0091] In step 702, it is checked whether or not the concerned document number d is equal to or smaller than the 
number N of all documents. When d is larger than N, the control flow proceeds to step 703 to temiinate this calculation 
processing. When d is equal to or smaller than N, the control fkjw proceeds to step 704. 
[0092] In step 704, the keyword number t is Initialized to 1 , and the nomnallzed factor s is initialized to 0. 
[0093] In step 705, a frequency-of-occurrence f^^ of the keyword t of the document d is obtained with reference to 

30 the keyword frequency-of-occurrence file 103. Then, the normalized factor s is replaced with a sum of s and tat (^•® ' 

s<- S+ fjjt). 

[0094] In Step 706, the concemed keyword number t is incremented by 1 (i.e., t<-t+1). 

[0095] In step 707, it is checked whether the keyword number t is equal to or smaller than the number M of all 
keywords. When t is equal to or smaller than M, the control flow returns to step 705 to process the next keyword. 
35 [0098] Through the above steps 704 to 707, repetitive appearances of the same keyword are counted every time to 
obtain the total number of keywords appearing in the document d. The obtained total is referred to as the normalized 
factor s. 

[0097] After the normalized factor s Is obtained according to the document profile vector calculating mode in this 
manner, the control How proceeds to istep 708. 
40 [0098] In step 708, the document profile vector of the document d Is calculated based on a relative frequency-of- 
occurrence vector (I.e., (frf^/s, — , fdM/s)) and sent to the document profile vector file 111. 

[0099] In step 709, the concerned document number d is Incremented by 1 (d^d+l ). Then, the control flow returns 
to step 702 to process the next document. 

.[0100]: ..Thraugh.theabov.erdescrto.edpr.ocessiag..th^..^^^^ . 

45 [0101] For example, when a calculation value of the nonnallzed factor s Is 92, the document profile vector of the 
document #1 in the newspaper article full text database shown In Fig. 2 Is obtained in the following manner with ref- 
erence to the keywords numbered in the word dictionary of Fig. 4. 



(2/92, 1/92, 0, 1/92, 1/92, 0, 0, 1/92, ) 

50 

where the first component of the document profile vector corresponds to #1 keyword"J>l stored in the 

word dictionary. Simflarly, second and third components correspond to #2 keyword "IT" and #3 keyword "YVi^W in the 
word dictionary. 

55 [0102] While the document profile vector file 11 1 is created, the keyword profile vectorfile 109 is created. Fig. 8 is a 
flowchart showing a procedure for calculating the keyword profile vector data. 

[0103] In step 801 , to produce a keyword profile vector, a concemed keyword numbert is initialized to 1 (i.e.. t <^1 ). 
[0104] In step 802, it Is checked whether or not the concerned keyword number t Is equal to or smaller than the 
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number M of all keywords. When t is larger than M, the control flow proceeds to step 803 to terminate this calculation 
processing. When t is equal to or smaller than M, the control flow proceeds to step 804. 
[0105] In step 804, the document number d Is initialized to 1 , and the normalized factor s is initialized to 0. 
[0106] In step 805, a frequency-of-occurrence f^jt of the keyword t of the document d is obtained with reference to 
5 the keyword frequency-of-occurrence file 1 03. Then, the nonnalized factor s is replaced with a sum of s and f^i (i.e., 

[0107] In step 806, the concerned document number d Is incremented by 1 (i.e., d<-d+1 ). 

[0108] In step 807, it is checked whether the document number d is equal to or smaller than the number N of all 
documents. When d is equai to or smaller than N, the control flow returns to step 805 to process the next keyword. 
10 [0109] Through the above steps 804 to 807, repetitive appearances of the same document are counted every time 
to obtain the total number of documents containing the concerned keyword t. The obtained total is refen^ed to as the 
normalized factor s. 

[0110] After the normalized factor s is obtained according to the keyword profile vector calculating mode in this 
manner, the control flow proceeds to step 808. 

. . ; V 15 [0111} In step 808. the keyword profile vector of ^he -keyword t is calculated based on a^relativeifrequeney-ofTOcour-^-r - ■ .. .j . .^r » 

rence vector (I.e., (f^/s, — , f^t/s)) and sent to the keyword profile vector file 1 09. 

[0112] In step 809, the concerned keyword number t is Incremented by 1 (t<- 1+1 ). Then, the control flow returns to 
step 802 to process the next keyword. 

[0113] Through the above-described processing, the keyword profile vector file 109 Is created. 
20 [0114] For example, when a calculatfon value of the normalized factor s is 283, the keyword profile vector of #1 
keyword "fov ^$0" in the newspaper article full text database shown in Fig. 2 Is obtained In the following manner. 

(1/283. 0, 0, 0, 0, 0. 1/283,-—) 

» 

» 

where the first component of the keyword profile vector corresponds to a relative frequency-of-occurrence of #1 
keyword "feV ^^$0'' in a newspaper article of the document #1 . Similarly, the second component corresponds to a 
relative frequency-of-occurrence of the #1 keyword "feV ^eO" in a newspaper article of the document #2. 
[01 15] In this manner, the frequency-of-occurrence of the keyword "^V ^§ o" in the document #1 is converted into 
different values and incorporated into vectors at the document side and the keyword side. This is apparently different 
from the conversion of keyword frequency-of-appearance data according to the conventional LSI method. In other 
words, prior to the statistical analysis, such as principal component analysis, the present invention introduces the vector 
representations for the document and the word which are essentially different from conventional ones. 
[01 16] Furthermore, each of the document profile vector and the keyword profile vector is stable and not dependent 
on the document length and the keyword weight. 

[01 17] After creating both of the document and keyword profile vector files, the document principal component an- 
alyzing section 114 and the keyword principal component analyzing section 1 1 2 perform the principal component anal- 
ysis on respective profile vector data with reference to the document length and the keyword weight. Through the 
principal component analysis, feature vectors of K dimensions (K is predetermined as an "analysis dimension") of each 
document and each keyword are obtained together with the contribution factor of each dimension. 
[01 18] Regarding the principal component analysis of document data, the analysis is perfonmed based on the fol- 
lowing fundamental procedures. 

[0119] (1) To calculate an Inner product between document profile vectors and of two documents a and b 
stored in the newspaper full text database lOl^.the fp.llowing product-sum of weighted. componerUs .ia jntroduced..»M^. - - - ^. v . . 

SWt*f/ht*p^t*Pbt (2) 

(I represents a sum from t=1 to M) 

where W| represents a weight of each keyword t stored in the keyword weight file 1 07, h^ represents an overall 
frequency-of-occurrence value of keyword t stored in the newspaper full text database 101 , and f represents the sum 
of frequency-of-occurrence values of all keywords in the newspaper full text database 101 . A square root of h/f rep- 
resents a degree of dispersion (I.e., an evaluation value of standard deviation) of components p^i and p^j of the doc- 
ument profile vectors and Pb, In this respect, the above fomnula (2) contains two factors of weight w^ and the square 
55 rootofh/f. 

[0120] (2) The principal component analysis is executed on the assumption that the document profile vectors of the 
document d of the document length \^ are contained in a document profile vector group serving as an object of the 
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principal component analysis by the number proportional to a ratio g^l^i, where represents the total number of all 
keywords appearing in the document d and 1^ represents the document length of the document d. 
[0121] Meaning of the above-described two fundamental procedures for the document data principal component 
analysis is as follows. Rrst, the procedure (1) is equivalent to obtaining an Inner product by assuming that there are a 
5 total of M coordinate axes (each corresponding to a keyword) in a space of M-dimenslonal document profile vector, 
and the space of M-dimensional document profile vector Is a distorted space according to which calibrations of respec- 
tive M coordinate axes are different from each other. Furthermore, it is assumed that there are a plurality of coordinate 
axes relating to each keyword t and the number o1 related coordinate axes Is proportionalto the weight Wt of the keyword. 
[0122] Namely, so as to equalize a dispersion of each component conresponding to the frequency of occurrence 
10 (:^each keyword), the components Pgj and Ptt of two document profile vectors and are respectively divided by 
{h^^^ and then multiplied with each other. Furthennore, by assuming that the number of products thus produced is 
equal to w^, the keyword weight is directly reflected to the inner product between the document profile vectors. 
[0123] Next, according to the procedure (2), an importance of each document is regarded as g^/ld representing a 
frequency-of-appearance density of keyword, i.e., "the total number" of keywords appearing per unit document length. 

> 15.. lathe principal component analysis of a total of N document profile.vectors, th^^tatistipaJ ,9n.a|ysi:S processing ig. per;- , 

formed on the assumption that the profile vectors of each document d are contained by the number corresponding to 
the Importance g/\^. Thus, the analysis of this Invention Is performed by thinking much of the documents having higher 
importance. In other words, the feature vectors resulting from this analysis indirectly reflect the importance of each 
document. 

so [0124] Regarding the degree of dispersion of components Pat and Pbt of the document profile vectors Pa and Pb, i. 
e., an expression "square root of h^ff" representing an evaluation value of standard deviation, it is possible to derive 
the degree of dispersion through approximation of the appearance probability of each keyword t in the document d by 
using a Poisson distribution having an average and a dispersion of (gd*ht)/f. where represents the total number of 
keywords appearing in the document d and f represents the total number of keywords appearing in the newspaper full 

25 text database. 

[0125] Fig. 9 is a flowchart of the principal component analysis perfomied in the document principal component 
analyzing section 114. 

[012B] In step 901, a coordinate conversion is performed to deforni the "distorted space" of the procedure (1) into 
an ordinary space whfch enables to obtain an internal product of vectors as a product-sum of components. The coor- 
30 dinate conversion is applied to each document profile vector P j to calculate a document profile vector resulting from 
this coordinate conversion according the following formula. 

Xd=f''^-w''^*H*'^*Pd (3) 

35 

where f^/2 represents a square root of "the total number" f of all keywords appearing In the newspaper full text 
database, W^^ represents a diagonal matrix of M lines x M rows containing a square root w^^^ of the keyword weight 
Wt of the keyword t as an element of t-th line and t-th row, and H*i^ represents a diagonal matrix of M lines x fy/l rows 
containing an inversed square root h{'^^ of the overall frequency-of-occurrence value of keyword t stored In the news- 
40 paper full text database as an element of t-th line and t-th row. By applying this conversion, it is easily confirmed that 
the inner product of the document profile vector Xj resulting from the conversion Is representable as a product-sum 
of the components. 

[0127] Next, in step 902, a weighted document correlation matrix data A is calculated based on a matrix X of M lines 

. ....x..KrQY«.containing.a value Xj (teierXq f pmiula 3) tQ.the.d:th .row,.as. vicelLa^Js^tr^a?^^^ , , 

^ following formula. 

A=X*(G*L''')*X' (4) 

where G represents a diagonal matrix of N lines x N rows containing "the total number" g^, of keywords appearing 
In the document d as an element of d-th line and d-th row. and L"*" represents a diagonal matrix of N lines x N rows 
containing an Inverse number \^'^ of the document length 1^ of the document d as an element of d-th line and d-th row. 
[0128] Next, In step 903, the obtained matrix A is numerically decomposed Into a total of K eigenvalues X1 . A2, — . 

successively in order of magnitude as well as a total of K eigenvectors T^ , T2, — , T^ nomializedso as to con-espond 
to the decomposed eigenvalues. 

[0129] Finally, in step 904, a feature vector U^of each document d is obtained as the following K-dimensional vector 
of the converted document profile vector X^j which has components representing projections to the K eigenvectors 



EP 1 168 202 A2 



obtained in step 903. 

Ud=n"i-Xa,T2.Xd.— .VX^) (5) 

[0130] Then, considering K eigenvalues XI . X2, — , Xk as "contribution factors", a total of N vectors of K dimen- 
sions are obtained as the "feature vector" of each document and stored In the document principal component analysis 
result file 115. 

[0131] Regarding the principal component analysis .of keyword, the analysis is perfonned based on the following 
fundamental procedures. 

[0132] (1) To calculate an inner product between keyword profile vectors and Qb of two keywords Ka and Kb 
appearing in the newspaper full text database 1 01 , the following product-sum of weighted components is introduced. 

. ....... ......... - ^*^(id:gdrqad*qbd .(s) 

where 1^ represent the document length of document d stored In the document length file 1 05, g^, represents the 
total number of all keywords appearing in the document d, and f represents the sum of frequency-of-occurrence values 
of all keywords in the newspaper full text database 1 01 . A square root of represents a degree of dispersion (i.e., 
an evaluation value of standard deviation) of components q^^ and q^^^ of the keyword profile vectors and 0^* in this 
respect, the above formula (6) contains two factors of document length 1^, and the square root of gj/f. 
[0133] (2) The principal component analysis is executed on the assumption that the keyword profile vectors of the 
keyword t of the keyword weight W| are contained in a keyword profile vector group serving as an object of the principal 
component analysis by the number proportional to a hi*Wt, where represents an overall frequency-of-occurrence 
value of keyword t stored in the newspaperfull text database 101 and w^ represents the keyword weight of the keyword t. 
[0134] Meaning of the above-described two fundamental procedures for the keyword principal component analysis 
is as follows. First, the procedure (1) is equivalent to obtaining an Inner product by assuming that there are a total of 
N coordinate axes (each corresponding to a document) in a space of N-dimensional keyword profile vector, and the 
space of N-dimenslonal keyword profile vector is a distorted space according to which calibrations of respective N 
coordinate axes are different from each other. Furthemiore, it is assumed that there are a plurality of coordinate axes 
relating to each document d and the number of related coordinate axes is inverse proportional to the document length Ij. 
[0135] Namely, so as to equalize a dispersion of each component con-esponding to the frequency of occurrence 
(=relative frequency-of-occurrence In each document), the connponents q^^ and q^^ of two keyword profile vectors 
and Qb are respectively divided by (Qt/f)^^ and then multiplied with each other. Furthemiore, by assuming that the 
number of products thus produced Is equal to I/*" , the document length is directly reflected to the Inner product between 
the keyword profile vectors. 

[0136] Next, according to the procedure (2), an importance of each keyword is regarded as h^/Wt. In the principal 
component analysis of a total of M keyword profile vectors, the statistical analysis processing is performed on the 
assumption that the profile vectors of each keyword t are contained by the number corresponding to the importance 
hf/wt. Thus, the analysis of this invention is performed by thinking much of the keywords having higher importance. In 
other words, the feature vectors resulting from this analysis indirectly reflect the importance of each keyword. 
[0137] Regarding the degree of dispersion of components q^^ and q^d of the keyword pnDf lie vectors and Qj,, i. 
e., an expression "square root of gjj/f" representing an evaluation value of standard deviation, it is possible to derive 
the degree of dispersion through approximation of the appearance probability of leach keyword tjRitherdocunnentid b.y-v 
using a Poisson distribution having an average and a dispersion of (gd*ht)/f, where g^ represents the total number of 
keywords appearing in the document d and f represents the total number of keywords appearing in the newspaper full 
text database. 

[0138] The keyword analysis processing of this Invention can be performed independently without giving any influ- 
ence to the document analysis processing, and is therefore different from the conventional LSI method. 
[0139] Fig. 10 is a flowchart of the principal component analysis performed in the keyword principal component 
analyzing section 112. 

[0140] In step 1001 , a coordinate conversion Is performed to delonn the "distorted space" of the procedure (1) into 
an ordinary space which enables to obtain an Internal product of vectors as a product-sum of components. The coor- 
dinate conversion is applied to each keyword profile vector Qt to calculate a keyword profile vector resulting from 
this coordinate conversion according the following formula. 
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Yt = f''^*L"'^*G'""*Qd (7) 

where represents a square root of "the total number" f of all keywords appearing in the newspaper full text 
5 database, L'^'^represents a diagonal matrix of N lines x N rows containing an in versed square rootl/''^ of the document 
length Ij of the document d as an element of d-th line and d-th row, and G''^^ represents a diagonal matrix of N lines 
X N rows containing an inversed square root g^f"^^^ of the total number of keywords appearing in the document d as 
an element of d-th line and d-th row. By applying this conversion, It is easily confirmed that the inner product of the 
keyword profile vector Y( resulting from the conversion is representable as a product-sum of the components. 
^0 [0141] Next, in step 1 002, a weighted keyword correlation matrix data B is calculated based on a matrix Y of M lines 
X N rows containing a value (refer to fonnula 7) in the d-th row as well as its transposed matrix Y' according to the 
following fonnula. 



B=Y*(H*W)*Y' . (B) 

where H represents a diagonal matrix of M lines x M rows containing the overall frequency-of-appearance value 
ht of the keyword t as an element of t-th line and t-th row, and W represents a diagonal matrix of M lines x M rows 
containing the weight w^ of the keyword t as an element of t-th line and t-th row. 
20 [0142] Next, in step 1003, the obtained matrix B is numerically decomposed into a total of K eigenvalues 61 , ©2,-—, 
Ok successively In order of magnitude as well as a total of K eigenvectors , Z2, — , nonnalized so as to con^espond 
to the decomposed eigenvalues. 

[0143] Finally, in step 1 004, a feature vector of each keyword t is obtained as the following K-dimensional vector 
of the converted keyword profile vector Yj which has components representing projections to the K eigenvectors ob- 
25 tained in step 1 003. 



Vt = (Z^.Yt.Z2-Yt. .Zk-Yj) (9) 

[01 44] Then, considering K eigenvalues 91 , 02, — , 6^ as "contribution factors', a total of M vectors of K dimensions 
are obtained as the "feature vector* of each keyword and stored in the keyword principal component analysis result 
file 113. 

[0145] As described above, after acconplishing the processing of stage (II), the keyword principal component anal- 
ysis result file 113 and the document principal component analysis result file 115 are created via the keyword profile • 
vector file 1 09 and the document profile vector file 1 1 1 . At this moment, all of the preparations necessary for receiving 
retrieval/extraction conditions are completed. 

[0146] After this moment, when the conditions for the similar document retrieval and relevant keywond extraction 
{either a string of document numbers or a string of keywords) are entered through the condition input section 116, the 
processing of stage (III) is Initiated for the similar document retrieval and relevant keyword extraction. 
[0147] First, the similar document retrieval processing will be explained with reference to the drawings. Fig. 11 is a 
flowchart showing the procedure for calculating the retrieval condition feature vector performed in the retrieval condition 
feature vector calculating section 117. 

[0148] Instep 1101, it is checked whether or not a character string entered through the condition input section 116 

is a string of document numbers.. When the enterecl.character string is a .string pf dopM.m^Dt oym.becs..the contrpl flow 

proceeds to step 11 02. Othenwise, the entered character string is regarded as a string of keywords and the control flow 
proceeds to step 11 03. More specifically, if an entered character string consists of numeric characters Including at least 
one of digits "O" to "9" joined by comma the entered character string is judged as a string of document numbers. 
[0149] In step 1102, when the entered character string consists of document numbers, a feature vector of the 
document d contained in the entered character string is obtained with reference to the document principal component 
analysis result file 1 1 5. Then , an average vector R of the obtained feature vectors Uj Is calculated. The average vector 
R is obtained by multiplying a sum of document feature vectors with an inverse of the document number r. 
[0150] in step 1 1 03, when the entered character string consists of a total of r keywords, a M-dimensional vector E 
is created so that a component corresponding to a keyword number of each entered keyword is 1/r and the remaining 
components are 0. 

[0151] In step 11 04, a K-dimensional vector R is calculated with reference to the keyword principal component anal- 
ysis result file 113 and the keyword weight file 1 07. 
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(10) 



where '• rep resents a diagonal matrix of K lines x K rows containing an inversed contribution factor 8j""> of each 
^ dimension of the keyword feature vector, V represents a keyword feature matrix of K lines x IVl rows containing a 
keyword feature vector Vt of the keyword t In the t-th row, and W represents a diagonal matrix of M lines x M rows 
containing the weight Wj of each keyword as an element of t-th line and t-th row. 

[01 52] In step 11 05. the K-dimenslonal vector R created in step 11 02 or steps 11 03 to 11 04 is recognized as a retrieval 
condition feature vector and sent to the first similar document retrieving section 1 1 9 and to the second similar document 
retrieving section 1 20. 

[01 53] In obtaining the vector R in step 1 1 04, only necessary thing Is to obtain the keyword weight corresponding 
to the component of E which Is not 0 and the keyword feature vector Vj from the keyword weight file 1 07 and the 
keyword principal component analysis result file 1 1 3. When the total number r of input keywords is in the level of several 
tens or less, cateulation of the vector R can be perfonned speedily. 
• [0154] Afterthe-retrievaicondition feature vector-cak:ulating section 117 has obtained the^etri^ • . 

vector R, the first similar document retrieving section 119 calculates an Inner product between the document feature 
vector stored in the document principal component analysis result file 11 5 and the retrieval condition feature vector 
R to select 1st to a-th largest documents in the calculated inner product value, where a is a predetemriined parameter 
representing a "displayed similar document number" Then, a total of a sets, each set being a combination of a document 
number and an Inner product between U^j and R. are sent to the result display section 1 23. 
[0155] At the same time, the second similar document retrieving section 120 calculates a distance between the 
document feature vector stored In the document principal component analysis result file 115 and the retrieval con- 
dition feature vector R to select 1st to a-th smallest documents in the calculated distance value. Then, a total of a sets, 
each set being a combination of a document number and a distance between and 1^-, are sent to the result display 
section 123. 

[0156] Regarding the method for effectively selecting l^* to a-th largest Inner product values or 1«* to a-th smallest 
distance values with respect to the vector R from numerous vectors, unexamined Japanese patent publication 
11-363058 discloses a 'Vector index building method and similar vector retrieval method" and other conventionally 
known vector retrieval methods can be used to effectively obtain a total of a similar documents. Details of such high- 
speed retrieval method of similar vectors give no i nfluence to the gist of the p resent invention and therefore explanation 
will not be necessary. 

[0157] Next, the relevant keyword extraction processing will be explained with reference to the drawings. Fig. 1 2 is 
a flowchart showing the procedure for calculating the extracting condition feature vector performed in the extracting 
condition feature vector calculating section 118. 
^ [0158] Instep 1201, ft is checked whether ornot a character string entered through the condition input section 116 
is a string of keywords. When the entered character string is a string of keywords, the control flow proceeds to step 
1202. Otherwise, the entered character string is regarded as a string of document numbers and the control flow pro- 
ceeds to step 1203. More specifically, if an entered character siring consists of numeric characters Including at least 
one of digits "0" to "9" joined by conrana the entered character string is judged as a string of document numbers. 
Othenwise, the entered character string Is regarded as a string of keywords. 

[0159] In step 1202, when the entered character string consists of keywords, a feature vector Vj of the keyword t 
contained in the entered character string is obtained with reference to the keyword principal component analysis result 
file 1 1 3. Then, an average vector R of the obtained feature vectors is calculated. The average vector R Is obtained 

by multiplying a sum of keyword feature vectors with an Inverse of the keyword purnbeir p-: . .v. o.*. . ^, ^ .l., . ...... 

[0160] In step 1203, when the entered character string consists of a total of r document numbers, a N-cPmensional 
vector E is created so that a component conesponding to a document number of each entered document is 1/r and 
the remaining components are 0. 

[0161] In step 1204, a K-dlmensional vector R is calculated with reference to the document principal component 
analysis result file 115 and the document length file 105. 

so 

R = A'''*U*L'^*E (11) 



where A"* represents a diagonal matrix of K lines x K rows containing an Inversed contribution factor )^f^ of each 
dimension of the document feature vector, U represents a document feature matrix of K lines x N rows containing a 
document feature vector U^j of the document number d in the d-th row, and L"^ represents a diagonal matrix of N lines 
X N rows containing the document length Ijj of each document d as an element of d-th line and d-th row. 
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[0162] In step 1205, the K-dimenslonal vector R created in step 1202 or steps 1203 to 1204 Is recognized as an 
extracting condition feature vector and sent to the first relevant keyword extracting section 121 and to the second 
relevant keyword extracting section 1 22. 

[01 63] In obtaining the vector R in step 1 204. only necessary thing Is to obtain the document length Ij corresponding 
5 to the component of E which is not 0 and the document feature vector from the document length file 105 and the 
document principal component analysis result file 115. When the total number r of input document numbers Is in the 
level of several tens or less, calculation of the vector R can be performed speedily. 

[0164] After the extracting condition feature vector calculating section 118 has obtained the extracting condition 
feature vector R, the first relevant keyword extracting section 121ca!culates an Inner product between the keyword 

10 feature vector stored in the keyword principal connponent analysis result file 113 and the extracting condition feature 
vector R to select 1st to p-th largest keywords in the calculated inner product value, where p is a predetermined pa- 
rameter representing a "displayed relevant keyword number." Then, a total of P sets, each set being a combination of 
a keyword and an inner product between and R, are sent to the result display section 123. 
[0165] At the same time, the second relevant keyword extracting section 122 calculates a distance between the 

J5 . keyword feature vector VtStored in the keyword pjlncipal. component analysis resgltfiJe .1.13.9nd.th.e extracting condition, 
feature vector R to select 1st to p-th smallest keywords in the calculated distance value. Then, a total of p sets, each 
set being a combination of a keyword and a distance between and R, are sent to the result display section 123. 
[0166] Regarcling the method for effectively selecting 1^* to p-th largest inner product values or 1»* to p-th smallest 
distance values with respect to the vector R from numerous vectors, unexamined Japanese patent publication 

20 11-363058 discloses a "vector index building method and similar vector retrieval method" and other conventionally 
known vector retrieval methods can be used to effectively obtain a total of p relevant keywords. Details of such high- 
speed retrieval method of similar vectors give no influence to the gist of the present invention and therefore explanation 
will not be necessary. 

[0167] After both of a similar documents and p relevant keywords are obtained in this manner, the resuft display 
25 ■ section 123 selects either the result based on the Inner product or the result based on'the distance according to the 
setting values of two kinds of parameters: i.e., the document similarity calculating mode ("inner product" or "distance") 
and the keyword relevancy calculating mode ("inner product" or "distance"). Then, the result display section 123 dis- 
plays character strings representing p keywords together with similarity values of p keywords as a relevant keyword 
extraction result Furthermore, the result display section 123 obtains the titles of the a similar documents based on 
30 their document numbere from the newspaper full text database 101. Then, the result display section 1 23 displays three 
Items, i.e.,"document number", "title", and "similarity", identifying each of the a similar documents thus extracted. 
[0168] In this manner, the processing for the stage (III) Is accomplished, thereby terminating the similar document 
retrieval and relevant keyword extracting processing responsive to an arbitrary input. 

[0169] The similar document retrieving and relevant keyword extracting system shown in Fig. 1 operates as described 
35 above. 

[0170] As described above, the similar document retrieving apparatus and the relevant keyword extracting apparatus 
according to a preferred embodiment of the present Invention expresses the frequency-of-appearance of each keyword 
contained in a concemed document as a document profile vector and also expresses the frequency-of -appearance of 
a concerned keyword in each document as a keyword profile vector. The document length data, the keyword weight 
40 data, and the component dispersion (i.e., standard deviation) are independently reflected to each profile as a weight 
(i.e., the number of components) in the calculation of the inner product between vectors (as a similarity measure) or 
as a weight (i.e., the number of vectors) In the principal component analysis. 

[0171 ] In this case, the vector representation in the document profile and in the keyword profile is not dependent on 
the conversion (i.e., normalization) of frequency- of-occurrence. The document length data, the keyword weight data, 

4s and the component dispersion are relevant to the conversion of frequency-of-occurrence. As described above, the 
document length data, the keyword weight data, and the component dispersion are indirectly reflected as the weight 
In the calculation of the inner product between vectors or eis the weight in the principal component analysis. Thus, it 
becomes possible to normalize the feature vector of each document and each keyword without depending on the 
conversion of frequency-of-occurrence. 

so [0172] Accordingly, the present invention solves the fundamental problems caused in an apparatus employing a 
conventional LSI method which is used In a statistic analysis on the single matrix data directly converted from the 
keyword frequency-of-occurrence data F. More specifteally, the present invention solves the problem of asymmetry 
caused in the conversion of keyword frequency-of-occun-ence, as well as the problem of non-stability caused by the 
mergence of documents or keywords with respect to the document similarity or the keyword relevancy. As a result, the 

55 present invention makes it possible to provide the similar docunnent retrieving apparatus and the relevant keyword 
extracting apparatus which are highly accurate. 

[0173] The above-described embodiment depends on the specific methods to perform the word segmentation, the 
document length calculation, and the keyword weight calculation. However, the gist of the present invention is not 
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dependent on the disclosed specific methods. Thus, various methods can be flexibly employed considering the type 
of document database, and conditions or purposes for the retrieval and extracting operation, or the like. 
[01 74] Even in such a case, the present invention differs from the conventional LSI method in that no adverse effects 
will be given to the principal component analysis result, as well as to the similar document retrieval result and the 
5 relevant keyword extraction result. In this respect, so-called moderate effect will be reflected to the analysis result and 
retrieval/extraction result. 

[0175] As a result, the present invention makes it possible to provide a method for segmenting words, extracting 
keywords, calculating adocument length, and calculating a keyword weight according to the type of document database 
as well as the conditions or purposes for the retrieval and extracting operation without causing adverse reaction. Thus, 
10 a very reliable and robust system Is constructed. 

[01 76] Furthermore, in the foregoing description, the above-described embodiment was explained as a system per- 
fomning both of the similar document retrieval and the relevant keyword extraction based on both of the similarity by 
the inner product and the similarity by the distance. However, it is needless to say that, if any part of the above-described 
system Is functionally unnecessary, such sections or files can be omitted to construct a subset of the system of Fig. 1 . 
. l^: • . [0177] As apparent from the foregoing description, the similar document. retrieving ^paratus,and the. relevantJ<ey- 
word extracting apparatus according to the present invention can overcome the problems encountered in the prior art 
and can realize a highly reliable similar document retrieval and relevant keyword extracting operation. 
[0178] Especially, when the present invention Is applied to a large-scale document database. It becomes possible 
without being influenced by adverse reaction to provide a method for segmenting words, extracting keywords, calcu- 
20 lating a document length, and calculating a keyword weight according to the type of document database as well as the 
conditions or purposes for the retrieval and extracting operation. Thus, it becomes possible to construct a very robust 
and highly accurate system. 



25 Claims 

1. A similar document retrieving apparatus applicable to a document database D (101) which stores N document 
data containing a total of M kinds of keywords and is machine processfble, for designating a retrieval condition 
consisting of a document group including at least one document x.,, — , selected from said document database 
30 D and for retrieving documents similar to said document group of said retrieval condition from said document - 

database D, 

characterized by 

keyword frequency-of- occurrence calculating means (102) for calculating a keyword frequency-of -occurrence 
35 data F which represents a f req uency-of-occurrence f^ of each keyword t appearing In each document d stored 

in said document database D; 

document length calculating means (104) for calculating a document length data L which represents a length 
1^ of said each document d; 

keyword weight calculating means (1 06) for cafcu lating a keyword weight data W which represents a weight 
40 W{ of each keywon:! t of said M kinds of keywords appearing In said document database O; 

document profile vector producing means (110) for producing a M-dimensional document profile vector 
having components respectively representing a relative frequency-of-occurrence p^ of each keyword t in the 
concerned document d; 

document principal component analyzing means (114) for performing a principal component analysis, on a 
4^ document profile vector group of a document group In said document database D and for obtaining a predefined 

(K)-dimensional document feature vector corresponding to said docun^ent profile vector P^for said each 
document d; and 

similar document retrieving means (119, 1 20) for receiving said retrieval condition consisting of the document 
group including at least one document x^ , — , x^ selected from said document database D, calculating a sim- 
so ilarity between each document d and said retrieval condition based on a document feature vector of said 

received document group and the document feature vector of each document d in said document database 
D, and outputting a designated number of similar documents in order of the calculated similarity. 

The similar document retrieving apparatus in accordance with claim 1 , wherein said similar document retrieving 
means (119) calculates the similarity between each document d and said retrieval condition based on an Inner 
product between the document feature vector of said received document group and said document feature vector 
of each document d. 
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3. The similar document retrieving apparatus in accordance with claim 1, wherein said similar document retrieving 
means (120) calculates the similarity between each document d and said retrieval condition based on a distance 
between the document feature vector of said received document group and said document feature vector of each 

document d. 

4. A similar document retrieving apparatus applicable to a document database D (101) which stores N document 
data containing a total of M Icinds of keywords and is machine processible, for designating a retrieval condition 
consisting of a keyword group including at least one keyword /i, — , ys selected from said document database D 
and for retrieving documents relevant to said retrieval condition from said document database D, 

10 characterized by 

keyword frequency-of-occun-ehce calculating means (1 02) for calculating a keyword frequency-of-occun-ence 
data F which represents a f requency-of-occun*ence f^t of each keyword t appearing in each document d stored 

In said document database D; 

. J5..., . document length calculating means. (104).tor .calculating a docunaentlength.data^^ . 

1^ of said each document d; 

keyword weight calculating means (1 06) for calculating a keyword weight data W which represents a weight 
Wt of each keyword t of said M kinds of keywords appearing In said document database D; 
document profile vector producing means (110) for producing a M-dimensional document profile vector 
20 having components respectively representing a relative frequency-of-occurrence p^i of each keyword t in the 

concerned document d; 

keyword profile vector producing means (1 OB) for producing a N-diironslorial keyword profile vector Qj having 
components respectively representing a relative frequency-of-occurrenoe q^^ of the concerned keyword t in 
each document d; 

25 document principal conponent analyzing means (114) for perfomning a principal component analysis on a 

document profile vector group of a document group in said document database D and for obtaining a predefined 
(K)-dimenslonal document feature vector corresponding to said document profile vector for said each 
document d; 

keyword principal component analyzing means (112) for perfonming a principal component analysis on a key- 
30 word profile vector group of a keyword group in said document database D and for obtaining a predefined (K)- 

dimensional keyword feature vector Vt con-esponding to said keyword profile vector Qi for said each keyword 

t, said keyword feature vector having the same dimension as that of said document feature vector, as well as 

for obtaining a keyword contribution factor(i.e., eigenvalue of a congelation matrix) Oj of each dimensbn j; 

retrieval condition feature vector calculating means (117) for receiving said retrieval condition consisting of 
35 keyword group including at least one keyword y^, — , y^, and for calculating a retrieval condition feature vector 

corresponding to said retrieval condition based on said keyword weight data of the received keyword group, 

said keyword feature vector and said keyword contribution factor; and 

similar document retrieving means (119, 120) for calculating a similarity between each document d and said 
retrieval condition based on the calculated retrieval condition feature vector and a document feature vector of 
40 said each document d, and outputting a designated number of similar documents in order of the calculated 

similarity. 

5. The similar document retrieving apparatus in accordance with claim 4, wherein said similar document retrieving 
means (1.19) cateulates the similarity between each document d and said retrieval condition based on an inner ^ 

45 product between said retrieval condltbn feature vector and said document feature vector of each document d. 

6. The similar document retrieving apparatus in accordance with claim 4, wherein said similar document retrieving 
means (120) calculates the similarity between each document d and said retrieval condition based on a distance 
between the retrieval condition feature vector and said document feature vector of each document d. 

so 

7. A relevant keyword extracting apparatus applicable to a document database D (101) which stores N document 
data containing a total of M kinds of keywords and is machine processible, for designating an extracting condition 
consisting of a keyword group including at least one keyword y^, — , y^ selected from said document database D 
and for extracting keywords relevant to said keyword group of said extracting condition from said document data- 

55 base D, 

characterized by 

keyword f requency-of-occurrence calculating means (1 02) for calculating a keyword f requency-of-occurrence 
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data F which represents a f requency-of-occurrence fj,t of each keyword t appearing in each document d stored 
in said document database D; 

document length calculating means (1 04) for calculating a document length data L which represents a length 

1^^ of said each document d; 

5 keyword weight calculating means (1 06) for calculating a keyword weight data W which represents a weight 

Wi of each keyword t of said M kinds of keywords appearing in said document database D; 
keyword profile vector producing means (1 08) for producing a N-dimensional keyword profile vector Q| having 
components respectively representing a relative f requency-of-occurrence of the concerned keyword t in 
each document d; 

10 keyword principal component analyzing means (1 1 2) for perfomning a principal component analysis on a key- 

worcl profile vector group of a keyword group in said document database D and for obtaining a predefined (K)- 
dimensional keyword feature vector corresponding to said keyword profile vector for said each keyword 
t; and 

relevant keyword extracting means (1 21 , 122) for receiving said extracting condition consisting of the keyword 
15 . .. ... ..group.includrng.atleastone.keyword.yi~~>ysSelec?ted from said. docum 

vancy between each keyword t and said extracting condition based on a keyword feature vector of said received 
keyword group and the keyword feature vector of each keyword t In said document database O, and o utputting 
a designated number of relevant keywords in order of the calculated relevancy. 

20 8. The relevant keyword extracting apparatus in accordance with claim 7, wherein said relevant keyword extracting 
means (121) cak:ulates the relevancy between each keyword t and said extracting condition based on an inner 
product between the keyword feature vector of said received keyword group and said keyword feature vector of 
each keyword t 

25 9. The relevant keyword extracting apparatus in accordance with claim 7, wherein said relevant keyword extracting 
means (122) calculates the relevancy between each keyword t and said extracting condition based on a distance 
between the keyword feature vector of said received keyword group and said keyword feature vector of each 
keyword t. 

30 10. A relevant keyword extracting apparatus applteable to a document database D (101) which stores N document 
data containing a total of M kinds of keywords and is machine processible. for designating an extracting condition 
consisting of a document group including at least one document x^ . — . x^ selected from said document database 
D and for extracting keywords relevant to the document group of said extracting condition from said document ■ 
database D, 

35 characterized by 

keyword frequency-of-occurrence calculating means (102) for calculating a keyword frequency-of-occun'ence 
data F which represents a f requency-of-occurrence f^^ of each keyword t appearing in each document d stored 
in said document database D; 

40 document length calculating means (1 04) for calculating a document length data L which represents a length 

1^ of said each document d; 

keyword weight calculating means (106) for calculating a keyword weight data W which represents a weight 
WjOf each keyword t of said M kinds of keywords appearing in said document database D; 
document profile ve.ctor producing means (110) for producing a M^imensional document profile vector P^, 
45 having components respectively representing a relative f requency-of-occurrence p^t of each keyword t in the 

concemed document d; 

keyword profile vector producing means (1 08) for producing a N-dimensional keyword profile vector having 
components respectively representing a relative f requency-of-occurrence q^^ of the concerned keyword t in 
each document d; 

50 document principal component analyzing means (114) for perfonning a principal component analysis on a 

document profile vector group of a document group in said document database D and for obtaining a predefined 
(K)-dimensionaI document feature vector U^^ corresponding to said document profile vector for said each 
document d as well as for obtaining a document contribution factor (i.e., eigenvalue of a correlation matrix) Xj 
of each dimension j; 

55 keyword principal component analyzing means (112) for perfomning a principal component analysis on a key- 

word profile vector group of a keyword group in said document database D and for obtaining a predefined (K)- 
dimensional keyword feature vector co responding to said keyword profile vector Qj for said each keyword 
t, said keyword feature vector having the same dimension as that of said document feature vector; 
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extracting condition feature vector calculating means (118) for receiving said extracting condition consisting 
of the document group including at least one document x^, — , Xp and for calculating an extracting condition 
f eatu re vector corresponding to said extracting condition based on said document length data of the received 
' document group, said document feature vector and said document contribution factor; and 
5 relevant keyword extracting means (1 21 . 1 22) for calculating a relevancy between each l<eyword t and said 

extracting condition based on the calculated extracting condition feature vector and a keyword feature vector 
of each keyworei t, and outputting a designated number of relevant keywords in order of the calculated rele- 
vancy. 

10 11, The relevant keyword extracting apparatus in accordance with claim 1 0. wherein said relevant keyword extracting 
means (121) cafculates the relevancy between each keyword t and said extracting condition based on an inner 
product between said extracting condition feature vector and said keyword feature vector of each keyword t. 

12- The relevant keyword extracting apparatus In accordance with claim 1 0, wherein said relevant keyword extracting 
15. . .. . means (122) calculates the relevancy between each.keyword t and said extractlng,conclitjp,n bgsed on a.^i^taace . 
between said extracting condition feature vector and said keyword feature vector of each keyword t. 

1 3. The similar document retrieving apparatus In accordance with claim 1 or claim 4 or the relevant keyword extracting 
apparatus in accordance with claim 10, wherein said document principal component analyzing means (114) cal- 

20 culates the inner product between two document profile vectors and Pb of two documents a and b contained 

in the document database D by using a product-sum of weighted components reflecting "said keyword weight data 
W and a degree of dispersion (i.e., an evaluation value of standard deviation) of components p^j and p^^ of said 
document profile vectors and P^, and performs said principal component analysis on the assumption that the 
document profile vectors of said document d of the document length 1^, are contained in the document profile vector 

25 group by the number proportional to a ratio gd/ld. where g^ represents the total number of all keywords appearing 

in the document d and Ij represents the document length of the document d. 

14. The similar document retrieving apparatus in accordance with claim 4 or the relevant keyword extracting apparatus 
in accordance with daim 7 or claim 10, wherein said keyword principal component analyzing means (112) calcu- 

30 lates the Inner product between two keyword profile vectors and Qj, of two keywords Ka and Kb contained in 

the document database D by using a product-sum of weighted components reflecting said document length data 
L and a degree of dispersion (i.e.. an evaluation value of standard deviation) of components and qbd of said 
keyword profile vectors and 0^, and perfomns said principal component analysis on the assumption that the 
keyword profile vectors of said keyword t of the keyword weight w, are contained in the keyword profile vector 

35 group by the number proportional to where h, represents an overall frequency-ol-occurrence value of keyword 
t and Wj represents the keyword weight of the keyword t. 

1 5. The similar document retrieving apparatus in accordance with claim 1 or claim 4 or the relevant keyword extracting ■ 
apparatus in accordance with claim 7 or claim 1 0, wherein said document length calculating means (1 04) compares 

40 a character number of the concerned document d with a predetermined threshold Iq and stores Iq as the length of 
said concerned document d when the character number of the concerned document d is less than Iq and stores a 
5-th (6 is a nonnegative integer) root of said character number when the character number of the concerned doc- 
ument d is equal to or larger than Iq. 

45 16. The Sim liar document retrieving apparatus In accordance with claim 1 or claim 4 orthe relevant keyword extracting 
apparatus In accordance with claim 7 or claim 1 0, wherein said document length calculating means (1 04) compares 
a total number of keywonils appearing in the concerned document d with a predetermined threshold Iq and stores 
Iq as the length of said concerned document d when the total number of keywords is less than Iq and stores a 5- 
th (6 is a nonnegative integer) root of said total number of keywords when the character total number of keywords 

so is equal to or larger than Iq. 

17. The similar document retrieving apparatus in accordance with claim 1 or claim 4 orthe relevant keyword extracting 
apparatus in accordance with claim 7 or claim 10, wherein said keyword weight calculating means (106) calculates 
the weight w^ of the concerned keyword t according to the following fomnula 

55 

1 +log2 ((N + G)/(n+ G)) 
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where IM represents the number of all documents, E represents a constant, and n represents the number of 
documents involving the concerned keyword t. 

18. The similar document retrieving apparatus in accordance with claim 1 or* -iim 4 or the relevant keyword extracting 
5 apparatus in accordance with claim 7 or dalm 10, wherein said keyword v. jght calculating means (106) calculates 

the weight w^ of the concerned keyword t according to the following formula 

log2((N + e + 1)/(n+ e)) 

io 

where N represents the number of all documents, E represents a constant, and n represents the number of 
documents involving the concerned keyword t. 

19. The similar document retrieving apparatus In accordance with claim 1 or claim 4 or the relevant keyword extracting 
'5 apparatus in accordance with claim 10, .wherein said document profile vector producing means (1 1 0) cak:ulates 

the relevant frequency-of-occurrence p^ of each keyword t In the concerned document d by dividing the frequency- 
of-occun-ence \^ ot each keyword t in the concemed document d by a sum Ef^j of the frequehcy-of-occurrence 
value of ail keywords j appearing in the concemed document d. 

20 20 . The similar document retrieving apparatus in accordance with claim 4 or th e relevant keyword extracting apparatus 
In accordance with claim 7 or claim 10, wherein said keyword profile vector producing means (108) calculates the 
relevant frequency-of-occurrence q^j^ of the concerned keyword t in each document d by dividing the frequency- 
of-occurrence fjj^ of the concerned keyword t in said each document d by a sum Sf^ of the frequency-of-occurrence 
value of the concemed keywords t in all documents j containing the concerned keyword t. 

25 

21. The similar document retrieving apparatus or the relevant keyword extracting apparatus in accordance with claim 
13, wherein said document principal component analyzing means (114) obtains said document feature vector on 
the assumption that the degree of dispersion of the component p^^t corresponding to the keyword t of the document 
profile vector of each document d in the document database D, is expressed by a square root of h^f, where h| 

30 represents an overall frequency-of-occurrence value of the keyword t and f represents a sum of frequency-of- 

occun^ence values of ail keywords. 

22. The similar document retrieving apparatus or the relevant keyword extracting apparatus in accordance with claim" 

13, wherein said document principal component analyzing means (114) calculates the inner product between two 
35 document profile vectors P3 and P|, of two documents a and b contained In the document database D by dividing 

each of the components p^^ and p^j corresponding to the keyword t of the document profile vectors Pg and Pb by 
the degree of dispersion of respective components and then multiplying the divisions thus obtained each other, 
and then multiplying the resultant value with the keyword weight data w^, and then obtaining a sum of the thus 
weighted value for all of the keywords t. 

40 

23. The similar document retrieving apparatus or the relevant keyword extracting apparatus in accordance with claim 

14, wherein said keyword principal component analyzing means (112) obtains said keyword feature vector on the 
assumption that the degree of dispersion of the component corresponding to the document d, of the keyword 
profile vector of each keyword t in the.dopurnent database D, is expressed . by a square root of q^^, wher^ g^j 
represents the total number of all keywords appearing In the document d and f represents a sum of frequency-of- 
occurrence values of all keywords. 

24. The similar document retrieving apparatus or the relevant keyword extracting apparatus in accordance with claim 
14, wherein said keyword principal component analyzing means (112) calculates the inner product between two 

so keyword profile vectors and of two keywords Ka and Kb contained in the document database D by dividing 

each of the components qgd and q^'d corresponding to the document d of the keyword profile vectors Qg and 
by the degree of dispersion of respective components and then multiplying the divisions thus obtained each other, 
and then dividing the resultant value by the document length 1^, and then obtaining a sum of the thus weighted 
value for all of the documents d. 

55 
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FIG. 4 
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